- Recap
- Machine Learning
- Supervised vs. Unsupervised Learning
- Basic Stats
- Simple Linear Regression
- Multiple Regression
What is machine learning?
ML relies on computers to combine inputs and produce predictions from brand new data.
In regression we predict continuous values, examples:
In contrast, classification predicts discrete values, examples:
What we have been covering so far is supervised learning! We give the computer/algorithm already labeled data so it knows what outcomes we want to predict.
When you have no labels, unsupervised learning comes into play. You still have features to work with but allow the algorithm to come up with patterns or structures.
In our current data rich times, we have the luxury of paritioning our data (usually) in two parts. The first is the training set which is a subset(normally around 75-80%) of your total data to train your model, and the test set which is the rest of your data that you test your model against to compare results.
We can add another parition called the validation set allows us to tweak the model before we go on and test it against our hold-out test set. That way we can save our test data to confirm the model.
…simpler solutions are more likely to be correct than complex ones. When presented with competing hypotheses to solve a problem, one should select the solution with the fewest assumptions
The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.
Error due to bias is the amount by which the expected model prediction differs from the true value of the training data. It is introduced by approximating the complicated model by much simpler model. High bias algorithms are easier to learn but less flexible, due to this they have lower predictive performance on complex problems. Linear algorithms and oversimplified model lead to high bias in the model. 
Error due to variance is the amount by which the prediction, over one training set, differs from the expected value over all the training sets. In machine learning, different training data sets will result in a different estimation. But ideally it should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in results. 

Correlation is a measure of how two variables are interacting. The numbers are between -1 and 1, with 0 meaning no relationship.
\[r_{xy} =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{(n-1)s_xs_y}\]
![]()
The joint variability of two random variables. Unlike correlation which is dimensionless, covariance is in units obtained by multiplying the units of the two variables. \[\text{Cov}(X,Y) =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{n-1}\]

This is one of the if not more foundational model in machine learning. While the simplicity is stark, it actually is extremely powerful. Many of you have already encountered this tool growing up or other elementary statistics courses. You've seen this equation:
\[ y =mx+b\]
where:
In machine learning, we right \(y = mx+b\) slightly differently.
\[ y = \beta_0 + \beta_1x_1\] where:
The errors in this case is the deviation or vertical distance from the regression line to the actual data.
\[e_i = y_i - \hat{y_i}\]

In this calucation we are using \(n-\text{df}\) since we hav to account for degrees of freedom.
\[\text{MSE} = \frac{\sum_{i=1}^n (y_i-\hat{y})^2}{n-\text{df}}\] ## Root Mean Square Error
\[\text{s} = \sqrt{\text{MSE}} \]
Errors and Residuals are almost identical, the main difference is on inference. If we have a fully known population, the deviation from the actual to the predicted is called an error. While if we take a sample distribution and take the deviations, they are called residuals.
OLS is the most common method for fitting a regression line. This allows us to calculate the best-fitting line over the observed data. The criteria is to minimize the sum of the squared errors, and since the deviations are first squared, there are no cancellations between positive and negative errors.
To calculate the regression coefficient or the \(\beta_1\) the equation is:
\[\beta_1 = \frac{\sum_{i=1}^n (y_i-\bar{y})(x_i-\bar{x})}{\sum_{i=1}^n (x_i-\bar{x})^2} = \frac{\text{Cov}(x,y)}{\text{Var}(x)}\]
irisMod <- lm(Sepal.Length~Petal.Length, iris) summary(irisMod)
Call:
lm(formula = Sepal.Length ~ Petal.Length, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.24675 -0.29657 -0.01515 0.27676 1.00269
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.30660 0.07839 54.94 <2e-16 ***
Petal.Length 0.40892 0.01889 21.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4071 on 148 degrees of freedom
Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
## MSE sum(residuals(irisMod)^2) / df.residual(irisMod)
[1] 0.1657097
##RMSE sqrt(sum(residuals(irisMod)^2) / df.residual(irisMod))
[1] 0.4070745
predict.lm(irisMod, data.frame(Petal.Length = 1:10))
1 2 3 4 5 6 7 8
4.715526 5.124448 5.533370 5.942293 6.351215 6.760137 7.169059 7.577982
9 10
7.986904 8.395826

There are often times you need to model a phenomena with more than one predictor variable. That's when you need add more inputs. The general equation is:
\[y_i = \beta_0 1 + \beta_1 x_{i1} + \beta_2 x_{i2}+ \cdots + \beta_p x_{ip}\]
Where two or more independent variables in a multiple regression model are highly linearly related. This often reduces the power of a model to identify independent variables that are statistically significant.
Structural multicollinearity is a mathematical artifact caused by creating new predictors from other predictors — such as, creating the predictor \(x^2\) from the predictor \(x\).
Data-based multicollinearity on the other hand, is a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which the data are collected.
When predictor variables are correlated, the estimated regression coefficient of any one variable depends on which other predictor variables are included in the model.
When predictor variables are correlated, the precision of the estimated regression coefficients decreases as more predictor variables are added to the model. More errors mean larger confidence intervals.
When predictor variables are correlated, the marginal contribution of any one predictor variable in reducing the error sum of squares varies depending on which other variables are already in the model. If one of the correlated variable is explaining the response variable there is less room for the other correlated variable to explain the response.
The residuals should be spread (relatively) equally along the ranges of predictors. 
You can add new parameters by using the \(+\) to append new input variables.
irisMod2 <- lm(Sepal.Length~Petal.Length + Petal.Width, iris) summary(irisMod2)
Call:
lm(formula = Sepal.Length ~ Petal.Length + Petal.Width, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.18534 -0.29838 -0.02763 0.28925 1.02320
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.19058 0.09705 43.181 < 2e-16 ***
Petal.Length 0.54178 0.06928 7.820 9.41e-13 ***
Petal.Width -0.31955 0.16045 -1.992 0.0483 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4031 on 147 degrees of freedom
Multiple R-squared: 0.7663, Adjusted R-squared: 0.7631
F-statistic: 241 on 2 and 147 DF, p-value: < 2.2e-16
library(plotly)
plot_ly(data = iris, x=~Petal.Length, y=~Sepal.Length,
z=~Petal.Width, type="scatter3d", mode="markers", color =~Species)